A new old algorithm, entropy sampling, and/or getting in and out of energy minima 



M.J. Thill 
Racah Institute of Physics 
The Hebrew University of Jerusalem 
91904 Jerusalem, Israel 
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I. INTRODUCTION 

With fast-growing computer technology, Monte Carlo 
(MC) simulations have with much success been used to 
study various statistical systems including neural net- 
works, problems in biology and chemistry, lattice-gauge 
theories and optimisation problems in various areas, not 
to mention statistical physics, to study the properties of 
phase transitions and critical phenomena. 

Most MC simulations concentrate on importance sam- 
pling for the canonical or microcanonical Gibbs ensemble, 
introduced by Metropolis et al 0. The thermodynamic 
average, O, of an observable 0(x) can be estimated ||] 
as 
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where x T represents a configuration at time r of a system 
with Hamiltonian H, ft = (fcT) -1 is the inverse temper- 
ature (with Boltzmann constant k), t a an averaging time 
(with t a ^> 1), and P{x) a sampling probability. If P{x) 
is chosen to be constant, very few samples contribute sig- 
nificantly to the sums in the above equation, and a very 
long time is required to get a reasonable estimate of O. 
Importance sampling comes in if one chooses P(x) as the 
Boltzmann weight exp(— (3H (x)). It is generally a good 
sampling algorithm, but can fail to access all the parts of 
the phase space in available computer time. Indeed, in 
many situations, this approach or similar ones face severe 
ergodicity problems if there exist many high barriers be- 
tween all the possible lowest- (or nearly-lowest-) energy 
configurations, as, e.g., in certain Lennard- Jones systems 
(see, e.g., HQ) and in spin glasses (see, e.g., [||-^)), to 
cite only two examples from statistical physics. 

To overcome these, at least for a large part, it has 
been suggested that it could be more efficient to recon- 
struct the Gibbs ensemble from a simulation with other 



ensembles (e.g., a so-called "multicanonical MC simula- 
tion" ) than to simulate it directly (see |9| and references 
therein) . One of these approaches goes under the name 
of the entropy sampling Monte Carlo method (ESMC). 
It works as follows. Let the probability of occurence of a 
configuration x with energy E be denoted as P(x), and 
the probability of occurence of a state with energy E as 
P(E). The term "state" stands here for the set of all con- 
figurations that have the same energy. They are related 
to each other through 



P(x) oc e-P E , 
P(E) cx N(E)e- 



I3E _ S{E)/k-0E 



(2) 



where I have introduced the entropy, S(E), of the state 
with energy E. Their number is N(E). In the Metropo- 
lis MC method yj], the canonical distribution of states 
is obtained, along with ergodicity, by a Markovian se- 
quence in which the transition probabilities, tt(x — ► x') 
and ir(x' — > x), between a pair of configurations x and x' 
are determined by the detailed balance condition 



tt{x -> x') _ exp[-/3E(x')} 
n(x' — > x) exp[— f3E(x)] 



(3) 



It can be shown rigorously that, in the case of a tradi- 
tional MC simulation, this condition ensures a simulation 
of the system with an equilibrium distribution which is 
just the Gibbs distribution (To). The ESMC method is, 
however, based on the probability distribution of states, 
in which the probability of occurence of a configuration 
with energy E is proportional to the exponential of the 
negative entropy, 



P{x) oc e ~S(E(x))/k ; 
P{E) oc N{E)e~ s ^' k 



(4) 



In a ESMC simulation the probability of occurence 
of a configuration with energy E is therefore anti- 
proportional to the number of configurations with that 
energy. In this way, the probabilities of occurence of all 
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states equal the same constant. An MC algorithm that 
does the job is one that is based on the detailed balance 
condition 

tt{x -> x') _ exp[-S(E(x'))/k] 
tt(x' ->x) ~ exp[-S(E{x))/k] ' 

In all other aspects, the formalism of the ESMC pro- 
cedure follows then the usual Metropolis procedure. It 
is therewith easy to show (using the methods exhibited, 
e.g., in [jlO|) that the ESMC algorithm simulates the sys- 
tem in such a way that all states occur with the same 
probability. Hence, the algorithm provides for a (one- 
dimensional) random walk through the system's energy 
space. 

In this spirit, Monte Carlo sampling with respect to 
unconventional ensembles has received some attention 
(see, e.g., [|| and references therein) in recent years. In 
the "multicanonical ensemble" approach jll],|l2],[l7) , one 
samples configurations such that the exact reconstruc- 
tion of canonical expectation values becomes feasible for 
a desired temperature range. Multicanonical and re- 
lated sampling has allowed considerable gains in situa- 
tions with "supercritical" slowing down, such as (i) first 
order phase transitions pl]jl2] , plj (for a recent review 
see, e.g., p3]), (h) systems with conflicting constraints, 
such as spin glasses @,|l|,|o|,|lJ or proteins Hl|. The 
reconstruction of canonical expectation values requires 
knowledge of the entropy values of the/an important part 
of the energy range (see equation ([!])), but leaves inno- 
vative freedom concerning the optimal shape Con- 
siderable practical experience exists only for algorithms 
where one samples such that (a) the probability density 
is flat in a desired energy range P(E) = const, and (b) 
each configuration of fixed energy E appears with the 
same likelihood. It should be noted that condition (b) is 
non-trivial. A simple algorithm |M exists to achieve (a), 
but which gives up (b). Exact connection to the canoni- 
cal ensemble is then lost. Such algorithms are interesting 
particularly for hard optimisation problems, but may be 
unsuitable for canonical statistical physics. The present 
paper focuses on achieving (a) and (b). To achieve a flat 
energy distribution, the appropriate unnormalised weight 
factor in equation ([j]) is S(E). However, before simula- 
tions, the entropy function S(E) is usually not known. 
Otherwise we would have solved the problem in the first 
place. Presumably, reluctance about simulations with an 
a-priori unknown weight factor is the main reason why 
the earlier umbrella sampling Q never became popular 
in statistical physics. 

In the more recent papers [9-30] it has been suggested 
to overcome this loophole by simulating with approxi- 
mate entropy values, obtained by guessing or a short 
Gibbs run, and then successively simulating with ever 
better estimates of the real entropy. The underlying as- 
sumptions have, however, up to now never been shown to 
hold rigorously, nor is there anything known on conver- 



gence properties. In this paper I propose an algorithm 
from which the entropy can be obtained in the large- 
time limit without attention "by hand" . I have applied 
the algorithm to the infinite-range, the two-dimensional 
and the three-dimensional ferromagnet, spin glass, and 
for the search of the global minimum of the generalisa- 
tion error in some on-line learning model which exhibits 
local minima besides a global one. In this letter, however, 
for illusatration and brevity, I concentrate mainly on the 
infinite-range ferromagnet. For definiteness, I will for- 
mulate the algorithm for Ising spin systems, containing 
a total number of N spins. I will enumerate the different 
states of the system by their energy, going from small- 
est to largest in value. Let E(m) denote the energy of 
the state with label to, say to = 0, 1, . . . , N — 1 (so that 
Af is the total number of energy levels). Let furthermore 
S(m; t) be the (estimated) entropy of the state with label 
to, at time t. At time t = 0, 1 initialise S(m; 0) = for all 
to. We will see below that only entropy differences matter 
in the algorithm so that the initialisation constant can in 
principle be chosen arbitrarily. The initialisation to zero 
is, however, preferable in the actual implementation of 
the algorithm on a computer, as mentioned below. 

The Free Energy Monte Carlo (FEMC) algorithm then 
works as follows. Let us assume that, at time t, the sys- 
tem is in configuration x, with energy E(x), i.e., with 
label m{x). Then, go through the following steps at time 
t+1: 

1 . Select one spin index i for which the spin Sj is con- 
sidered for flipping (sj —> — Sj). 

2. Calculate the transition probability 

n(x -> x';t + 1) 

:4( 1 _ tfflh «£M)_W£M) ( 6 ) 

to pass from configuration x to configuration x' 
which is obtained from x by effectuating the consid- 
ered spin flip. [In this present form the algorithm is 
still rather an "entropy Monte Carlo" than a "free 
energy Monte Carlo" algorithm; I will dwell on the 
full version of the latter below.] 

3. Draw a random number, r, uniformly distributed 
between zero and one. 

4. If r < tt(x — > x';t + 1), flip the spin, otherwise do 
not flip it. In any case, the configuration of spins 
obtained at the end of step 4. is counted as the 
"new configuration" , a; update . 

5. Now update the values of the entropy (fi = 
0,1,...,A/--1): 

t+1) := S(jj,; t) + e(i)5^ m ( x ) , (7) 
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where e(t) is a pre-chosen positive function which is 
sufficiently small in the large-time limit, m(x) the 
label of energy of the configuration x, and 5 denotes 
the Kronecker symbol. 

6. Go to 1. or end. 

Let us just note here that by the choice of tt(x — > 
x'\t + 1) as above, 1 ensured that the simulation veri- 
fies a detailed balance condition "locally", i.e., at every 
time step. Furthermore, it can be proven |35j that the 
algorithm converges indeed towards the entropy function 
of the system in the infinite-time limit. Let us finally 
note that a fter every time step, one should imagine, us- 
ing equation (0) , that the "zero" (baseline of the entropy 
values) has been shifted upwards. The entropy may then 
be obtained as a time average over instantaneous entropy 
values, and normalising these values with the help of the 
total number of configurations, 2 N . 

The above version of the algorithm does not deserve 
to be called an FEMC algorithm just yet, as only the 
entropy enters. However, if one wants to detect the min- 
ima (or maxima) in the energy space, it may be useful 
to change the transition probability of the algorithm to 
read: 

n(x -> x';t+ 1) 

:= \ (l - tanh 5(m(:r);tK ' g(m(a: ' ) 2 ;t) ~ /3(i?(:r) ~- E(:r,)) ^ 

(8) 

where now the transition probability does not depend 
solely on (instantaneous) entropy differences, but on (in- 
stantaneous) free energy differences, (3 being the "inverse 
temperature" as usual. If f3 is large, the temperature is 
small, and the system stays preferably in configurations 
with low energy (if one were to take (3 small or even neg- 
ative (!) one stays obviously preferably in configurations 
with high energy). This can be illustrated in particu- 
lar at the beginning of the algorithm, when the entropy 
differences are zero, and one performs a gradient descent 
algorithm towards a local or global minimum out of which 
one is then taken by a gradual increase in (instantaneous) 
entropy differences. In the following illustrations of the 
algorithm, I will, however, consider again the /3 = case 
only, for simplicity. 

I have applied the algorithm to the infinite-range, the 
two-dimensional and the three-dimensional ferromagnet 
to see how well the simplest version of the algorithm (the 
"/3 = FEMC algorithm") performs on obtaining the 
entropy values of the considered systems and on over- 
coming energy barriers. More specifically, in the case of 
the infinite-range ferromagnet, where I know the entropy 
(or number of configurations) exactly, I have investigated 
the convergence to the correct values of the entropy as 
well as the passage times ("tunneling times") between 
the "all-spins up" and "all-spins down" ground states. 



The values of e that I use are between e = 10 _1 and 
e = 10~ 3 . Certainly, e should tend to zero with increas- 
ing t to obtain even more accurate values of the entropy. 
However, the smaller e is already in the earlier stages of 
the algorithm, the longer it takes to reach the asymp- 
totic stage. If one takes e too small at the start of the 
algorithm, one risks to never leave, during the time of 
the simulation, the regime where effectively one samples 
according to performing a random walk in configuration 
space, as the differences in the entropy values will stay 
too small in the available simulation time. A good way to 
measure whether one has reached the asymptotic regime 
of the algorithm yet or not is to keep track of the sam- 
pled energies: if the histogram of the energies, averaged 
over a long enough period of time, is flat, asymptotics 
are reached. 

Some comments on the large-e limit are also in place. 
For e of order 1 or larger, the transition probability (||) 
of accepting a move becomes essentially or 1, as the 
differences in estimated entropy values for the different 
levels become larger than 1, hence the argument of the 
tanh. This has the following consequences. If, e.g., one 
is at time t, say, in a configuration of the level v(t), and 
the values of the estimated entropy values at the adjacent 
levels v(t) ± 1 are S(v(t) - 1) < S(v(t)) < S(v{t) + 1), 
then moves are (essentially) accepted iff the level of the 
considered update, v Mpdatc , is ^update = — The & 1~ 
gorithm can be viewed as putting a brick of height e at 
every time step onto a wall which is building up during 
the process. If the height of the wall at the adjacent 
place (level) whereto the move at time t + 1 is consid- 
ered is lower, the move is accepted with probability ~ 1, 
if the wall is of equal height, the move is accepted with 
probability ^, and else it is (essentially always) rejected. 
It is easy to notice that on average the wall will be of 
equal height for all levels if one substracts a running (in- 
creasing) baseline. This leads to the observed fact that 
all energy levels occur with equal probability, albeit the 
fact that the weight factors in the algorithm are not pro- 
portional to the true entropy values (on average). 

I have considered systems which contained N = 2™ 
spins where n = 2,3, ... ,9. I have compared runs where 
the "tunneling times" where measured from the begin- 
ning with ones where the tunneling times of the first 
x ■ N 2 (x = 1, 2, . . . , 10) Monte Carlo steps (MCS; de- 
fined as usual as the time needed to update all N spins 
in the system) were not taken into account. The (ex- 
pected) experience from these runs is that the distribu- 
tion of tunneling times, its mean (r m ) and random mean 
square (rms) value (r rma ) remained essentially unaltered. 
However, to be on the safe side, I have not counted the 
tunnclings observed during the first 10 • A/" 2 MCS in the 
runs whose results are displayed in the following. In all 
of the different runs, the histogram of the energies, aver- 
aged at the same time than the (instantaneous) entropy 
values, is flat, i.e., fluctuates around the mean number 
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of sampled configurations per energy to within at most a 
small fraction of a percent for small system sizes and up 
to at most ±5% for the largest systems considered. 

The results shown in figure 1 have been obtained with 
a total number of 10 6 + 10 • N 2 MCS. I have fixed e to 
equal 10 -3 , 10 -2 , and 10 _1 , respectively. This implies 
that the values r m and r rma , the mean and the rms value 
of the tunneling times, have been obtained for e = 1CP 3 
from 149338 values for n = 4 to 659 values for n = 9, for 
e = 1(T 2 from 148726 values for n = 4 to 1218 values for 
n = 9, for e = 1CT 1 from 146682 values for n = 4 to 1826 
values for n — 9. To check for statistical reliability of the 
data, I have also performed slightly shorter and longer 
runs. The error bars that I have got from these runs, at 
fixed averaging times (of the instantaneous entropy), are 
smaller than the size of the symbols in the figures. The 
distributions for the tunneling times, r, themselves are, 
however, intrinsically very broad (their width is of the 
order of their mean) with an accordingly long tail, possi- 
bly power-law-like. In the runs displayed in the figures of 
this paper, at most 432 tunneling events with times above 
5 • r m occured for N = 2, this number rapidly decreasing 
with system size to at most 2 such events for N = 512. 
The statistics for the larger systems may, indeed, be in- 
sufficiant because of the long tail of the distribution (the 
effect of a reduction of the measured r m can clearly be 
seen in figure 1 for the runs with e = 10~ 2 and 256 re- 
spectively 512 spins). However, the resulting error should 
not alter the results significantly. 

Figure 1 shows the increase of T m with the number of 
the spins in the system on a log-log scale. 



FIG. 1. Log- log-plot of the mean "tunneling times" r m 
versus the number of spins, N = 2" (n = 2, 3, . . . , 9), in 
the system for the infinite-range ferromagnet. The dia- 
monds stem from simulations with e = 10 -3 , the crosses 
from ones with e = 10 -2 , and the open squares from ones 
with e = KT 1 . 

I have fitted straight lines (by eyesight) to the data 
points in figure 1, corresponding to the fits r m = c m iV' 5m , 
for the different values of e. Because of higher statistical 



reliablitity of the results of the smaller size systems these 
have been taken account more than the ones of the larger 
size systems. The results for the fit parameters are 



ln(c„ 
ln(c„ 
ln(c„ 



■ 0.36 

■ 0.42 

■ 0.54 



i 2.10 
i 2.07 
' 1.99 



e = 10" 3 
e = 10- 2 
e = 1Q- 1 



(9) 



Remember that the exponent for a random walk in en- 
ergy space is 8 = 2 from the general results on one- 
dimensional random walks. In the above runs, we ob- 
tain that the width of the distribution is of the order of 
its mean; this can be noted easily already for 2V = 1, 
where one knows the (power-law-like) distribution of the 
tunneling times and its properties analytically. 

In all of the above simulations, I have compared the 
obtained values for the entropy with the true ones (that 
can be easily obtained exactly [see, e.g., P5fl l. Depending 
on the value of e, on the number of times that I averaged 
over the (estimated instantaneous) entropy values and 
on the total time of the simulation, I got more and less 
accurate results. In any case, for the runs of figure 1, 
the error was typically of the order of e. In figure 1, a 
comparison of the entropy values, obtained for a system 
containing 128 spins, using the algorithm with e = 10~ 2 
and running it for a total of 50960 MCS, with the exact 
values is shown. 




FIG. 2. Comparison of the entropy values, S(m), ob- 
tained through the FEMC algorithm and e = 10~ 2 
(crosses) with the true entropy values, and comparison 
of the ones obtained through a random walk in config- 
uration space (diamonds) with the true entropy values, 
for a system with 256 spins. The FEMC algorithm has 
been run in total for 50960 MCS, the one of the random 
sampling for 1040960 MCS. 

As can be seen from figure 2, the matching of the values 
(of the number of configurations, not of the entropy !) is 
quite impressive, taking into account the fact that there 
are 2 128 configurations. I have also compared it in the fig- 
ure to the values obtained by a run with a larger number 
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of MCS, namely 1040960 MCS, but performing a local 
random walk in configuration space. In this last run, I 
have counted the configurations during the run and nor- 
malised the sum of all the hits to 2 128 . It is easily seen 
from the figure that, in contrast to the FEMC algorithm 
(with much fewer MCS !), ergodicity has been lost, con- 
figurations with energy E(m) for m < 33 have not even 
been sampled ! 

In conclusion, in this paper, I have introduced a new 
algorithm in the spirit of the proposals of simulations 
with "multicanonical ensembles" . I have simulated the 
infinite-range ferromagnet to obtain the entropy function 
and to investigate the "tunneling times" of getting from 
one energy minimum to another one. The results are very 
encouraging: (i)The scaling of the "tunneling times" with 
the system size are roughly the ones of a random walk in 
energy space; (ii) the entropy function of the considered 
systems could be obtained roughly within errors of or- 
der e in the used computer time; (iii) more importantly, 
ergodicity could be retained in all the considered cases 
in the used computer time (~ 10 6 MCS at most). In 
particular, the last fact, the retaining of the ergodicity, 
should allow for a calculation of physical quantities, such 
as correlation functions, through equation (|l|) near zero 
temperature where conventional MC simulations fail. It 
is also interesting to use the full FEMC algorithm (with 
nonvanishing j3). More results with the latter shall be 
published elsewhere |3j| . It is particularly interesting to 
determine the best choice of the parameters e and (3 and 
of their ratio to overcome energy barriers in the least 
amount of time and to explore the energy space for min- 
ima or more generally extrema. Also, more analytical 
details and properties of the algorithm can be obtained 
for the case of the infinite-range ferromagnet |35| . It shall 
be interesting to apply the algorithm in the cases where 
hitherto ergodicity problems have not allowed to obtain 
results. 

One last application of the algorithm should be men- 
tioned here. Constructing a general model of on-line 
learning is an important challenge in the theory of learn- 
ing and its application. A plausible definition of the goal 
of supervised learning from examples is to find a weight 
vector w that minimises the generalisation error, e g (w). 
In an on-line learning algorithm, the learner receives a 
single new example at each time step and is unable to 
store previous examples in memory. The conventional 
on-line algorithm is based on the gradient of the instan- 
taneous error j3^j3£j. For a sufficiently small learning 
rate, it converges to a local minimum of e 9 (w) but not 
necessarily to the global one. More importantly, it is not 
applicable to learning of boolean functions or of other dis- 
crete valued functions which are extremely useful for de- 
cision and classification tasks. Recently, an on-line Gibbs 
algorithm has been proposed p0| as the first on-line algo- 
rithm that guarantees convergence to the minimal gener- 
alisation error for non-smooth systems, in particular for 



systems with discrete valued outputs or threshold hidden 
units. The price that is paid is an increased complexity 
of the computation at each presentation of examples. In 
particular, for systems in which the generalisation error 
has local minima, the on-line Gibbs algorithm may re- 
quire a slow annealing schedule of the temperature vari- 
able, used in the algorithm, which might yield a slow 
global convergence rate. In addition, the algorithm re- 
lics on the possibility of updating the weights by small 
increments. Consequently, it is inapplicable to systems 
with discrete valued weights. In the case of the FEMC 
algorithm, gradient descent and escaping from minima 
can be combined naturally (see equation (||)) so that the 
FEMC may offer a valuable alternative to optimise the 
learning procedure. Serious questions concerning on-line 
learning are still open, most importantly the one is on the 
global convergence rate, which is yet unknown in general. 
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